The goal of this assignment is to develop some intuition about the impact of the number of nodes in the hidden layer of the neural network. We will use few simulated examples to have clear understanding of the structure of the data we are modeling and will assess how performance of the neural network model is impacted by the structure in the data and the setup of the network.
First of all, to compensate for lack of coverage on this topic in ISLR, let’s go over a couple of simple examples. We start with simulating a simple two class dataset in 2D predictor space with an outcome representative of an interaction between attributes.
# fix seed so that narrative always matches the plots:
set.seed(1234567890)
nObs <- 1000
ctrPos <- 2
xyTmp <- matrix(rnorm(4*nObs),ncol=2)
xyCtrsTmp <- matrix(sample(c(-1,1)*ctrPos,nObs*4,replace=TRUE),ncol=2)
xyTmp <- xyTmp + xyCtrsTmp
gTmp <- paste0("class",(1+sign(apply(xyCtrsTmp,1,prod)))/2)
plot(xyTmp,col=as.numeric(factor(gTmp)),pch=as.numeric(factor(gTmp)),xlab="X1",ylab="X2")
abline(h=0)
abline(v=0)
Symbol color and shape indicate class. Typical problem that will present a problem for ant approach estimating a single linear decision boundary. We used similar simulated data for the random forest assignment.
Simulate data with n=1000 observations and p=3 covariates – all random variables from standard normal distribution. Create two category class variable assigning all observations within a sphere with radius of 1.5 centered at 3D zero to one class category and all others – to the second. Since you will be reusing this code in the following two problems it is probably best to turn this procedure into a function with appropriate parameters. Check that resulting class assignments split these observations very roughly evenly between these two groups. Plot values of the resulting covariates projected at each pair of the axes indicating classes to which observations belong with symbol color and/or shape (you can use function pairs, for example). What is the smallest number of planes in 3D space that would completely enclose points from the “inner” class?
simData <- function(n=1000, cols=3, inform=3, radius=1.5){
set.seed(1234)
x <- matrix(rnorm(n*cols),ncol=cols)
cl<-as.numeric(factor(sqrt(rowSums(x[,1:inform]^2))<radius))
#points within inner circle are set to class 2, outside =1
data.frame(CLASS=cl, x)
}
d1 <- simData()
head(d1)
## CLASS X1 X2 X3
## 1 1 -1.2070657494 -1.2053334196 -0.97381861936
## 2 2 0.2774292421 0.3014667398 -0.09963119709
## 3 1 1.0844411767 -1.5391452236 -0.11073500011
## 4 1 -2.3456977026 0.6353707235 1.19219459601
## 5 1 0.4291246888 0.7029517748 -1.65588593019
## 6 1 0.5060558922 -1.9058828646 -1.04564329068
pairs(d1[2:4], main = "", pch=21,bg = c("red","green3")[d1$CLASS])
#3d Visualisation
attach(d1)
sp <- scatterplot3d(X1,X2,X3,color=CLASS,cex.symbols=0.8, pch=19,angle=20)
#4 planes would be the minimum required to completely enclose a sphere
For the dataset simulated above fit neural networks with 1 through 6 nodes in a single hidden layer (use neuralnet implementation). For each of them calculate training error (see an example in Preface where it was calculated using err.fct field in the result returned by neuralnet). Simulate another independent dataset (with n=10,000 observations to make resulting test error estimates less variable) using the same procedure as above (3D, two classes, decision boundary as a sphere of 1.5 radius) and use it to calculate test error at each number of hidden nodes. Plot training and test errors as function of the number of nodes in the hidden layer. What does resulting plot tells you about the interplay between model error, model complexity and problem geometry? What is the geometrical interpretation of this error behavior?
#Create test dataset
test <- simData(n=10000)
train.errs <- numeric(6)
test.errs <- numeric(6)
for (i in 1:6){
nn <- neuralnet(CLASS~X1+X2+X3,data=d1,err.fct="sse",hidden=i)
#plot(nn)
#Given the difference in observations between test and train, we will conver to MSE
train.errs[i] <- nn$result.matrix[1]/nrow(d1)
my.predict <- compute(nn, test[,2:4])$net.result
test.errs[i] <- sum((test$CLASS-my.predict)^2)/nrow(test)
}
#Reshape data for ggplot
dd <- data.frame(rbind(train.errs,test.errs))
dd$lbl <- rownames(dd)
dx <-gather(dd,idx,err,X1:X6)
ggplot(dx,aes(idx,err,group=lbl)) + geom_line(aes(linetype=lbl, color=lbl)) + geom_point(aes(color=lbl)) + labs(x="Hidden Nodes" , y="MSE") +
theme(legend.position="top")
#We can see that the MSE drops until 4 nodes, then we see a levelling off of MSE despite an increase in model complexity. Geometrically, this is what we assumed in problem 1: That we would need a minimum of 4 hyperplanes to enclose our classification boundary that is a sphere.
Setup a simulation repeating procedure described above for n=100, 200 and 500 observations in the training set as well adding none, 1, 2 and 5 null variables to the training and test data (and to the covariates in formula provided to neuralnet). Draw values for null variables from standard normal distribution as well and do not use them in the assignment of the observations to the class category (e.g. x<-matrix(rnorm(600),ncol=6); cl<-as.numeric(factor(sqrt(rowSums(x[,1:3]^2))<1.5)) creates dataset with three informative and three null variables). Repeat calculation of training and test errors at least several times for each combination of sample size, number of null variables and size of the hidden layer simulating new training and test dataset every time to assess variability in those estimates. Present resulting error rates so that the effects of sample size and fraction of null variables can be discerned and discuss their impact of the resulting model fits.
dfTmp <- NULL
for (ns in c(100,200,500)){
for (nulls in c(0,1,2,5)){
pop <- simData(n=ns, cols=3+nulls)
nm <- names(pop)
fm <- as.formula(paste("CLASS ~", paste(nm[!nm %in% "CLASS"], collapse = " + ")))
for (hidden in 1:6){
bTrain <- sample(c(FALSE,TRUE),nrow(pop),replace=TRUE)
train <- pop[bTrain,]
test <- pop[!bTrain,]
nn <- neuralnet(fm,data=train,err.fct="sse",hidden=hidden)
predict <- compute(nn, test[,-1])$net.result
dfTmp <- rbind(dfTmp,data.frame(train.err=nn$result.matrix[1]/nrow(train),
test.err=sum((test$CLASS-predict)^2)/nrow(test),
hnodes=hidden,
nulls=nulls,
n=ns))
}
}
}
dx <- gather(dfTmp, err.type, err.val, train.err:test.err, factor_key=TRUE)
ggplot(dx,aes(x=as.factor(hnodes),y=err.val,shape=err.type,colour=n))+geom_point()+facet_wrap(~nulls)
#We can see that an increase in the number of covariates has a large effect on the test error when n is small. As sample size increases, test error starts decreasing. Training error is harldy affected by this.
#
Use neuralnet to model the outcome in banknote authentication dataset that we used in previous assignments and compare its test error at several sizes of hidden layer to that observed for SVM and KNN approaches.
dbaDat <- read.table("data_banknote_authentication.txt",sep=",")
colnames(dbaDat) <- c("var","skew","curt","entr","auth")
cls <- as.factor(dbaDat$auth)
set.seed(1234)
#Using 4 nodes per hidden layer, as 4 nodes performed the best previously
#10 recplicates to see variance of errors
dfTmp <- NULL
for (i in 1:10){
for (j in 1:10){
bTrain <- sample(c(FALSE,TRUE),nrow(dbaDat),replace=TRUE)
train <- dbaDat[bTrain,]
test <- dbaDat[!bTrain,]
nn <- neuralnet(auth~var+skew+curt+entr,err.fct="sse",data=train,hidden=rep(4,i))
cm <- as.matrix(table(test$auth,1+(compute(nn, test[,1:4])$net.result>1.5)))
dfTmp <- rbind(dfTmp, data.frame(i,err= 1-(sum(diag(cm))/sum(cm))))
}
}
boxplot(err~i, data=dfTmp, xlab="Hidden Layers")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:neuralnet':
##
## compute
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dfTmp %>% group_by(i) %>% summarize(mean=mean(err),sd=sd(err))
## # A tibble: 10 × 3
## i mean sd
## <int> <dbl> <dbl>
## 1 1 0.4405856692 0.015311929671
## 2 2 0.4471207983 0.019908265364
## 3 3 0.4425642358 0.009544485724
## 4 4 0.4465815719 0.016111468991
## 5 5 0.4497809521 0.011835899853
## 6 6 0.4417826854 0.011443185908
## 7 7 0.4441066346 0.017088008294
## 8 8 0.4513915247 0.014306502543
## 9 9 0.4542380697 0.010237581837
## 10 10 0.4431740796 0.011621295049
#dplyr compute has interactions with neuralnet, so detaching
detach("package:dplyr", unload=TRUE)
#We can see that the best performance for 4 nodes is given with 2 hidden layers
set.seed(1234)
bTrain <- sample(c(FALSE,TRUE),nrow(dbaDat),replace=TRUE)
train <- dbaDat[bTrain,]
test <- dbaDat[!bTrain,]
#Test misclassifcation rates
testError <- function(truth, predicted){
cm <- as.matrix(table(truth,predicted))
1-(sum(diag(cm))/sum(cm))
}
nm <- neuralnet(as.numeric(factor(auth))~var+skew+curt+entr,err.fct="sse",data=train,hidden=rep(4,2))
nn.err <- testError(test$auth,
1+(compute(nm, test[,1:4])$net.result>1.5))
plot(nm)
km <- knn(train,test,k=2,cl=cls[bTrain])
kn.err <- testError(test$auth,km)
svm <- svm(as.factor(auth)~var+skew+curt+entr, data=train, kernel="polynomial", gamma=.2,
cost =20)
svm.err <- testError(test$auth,predict(svm
,test[,-5]))
#Testing against best performing KNN and SVM models from Week 11
library(knitr)
kable(data.frame(nn.err, kn.err, svm.err))
| nn.err | kn.err | svm.err |
|---|---|---|
| 0 | 0 | 0.0102639296 |
#The performance of the neuralnet seems highly variable to the initial seed provided, similar to KNN. However, on this test set nn and knn performed equally well with SVM performing slightly worse